For my project I am using the Udacity provided Data set on Red Wine Quality. The data set includes 1599 observations of 12 variables. The 12 variables include 11 input variables which constitue a variety of factors that go into the output variable quality. Some of the input variables include acidity, sugar, sulfar dioxide, density, pH and alcohol. The quality variable is measured on a scale from 0 to 10, with 10 being the highest score.
Before I start with some univariate plots, I’m going to run some summary statistics to get a general overview of the variables.
## 'data.frame': 1599 obs. of 13 variables:
## $ : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Not a lot jumps out from these statistics since I’m not a wine or chemistry. expert. That said, it is interesting that while the quality rating goes from 0 - 10, the min and max of that variable in this data set is 3 and 8. In addition a quick survey of the summary data shows that MOST medians and means are relativey close togather, indicating something close to a normal distribution. The two variables where that is not the case are chorides (to a small degree) and total.sulfur.dioxide (to a greater degree).
Now I will create histograms to explore the distribution of these variables. For these univariate plots, I am going to break the plots and discussions into 3 groups. 1. Acids/Acidity 2. Chlorides/Sulphates/Sulfur dioxides 3. Sugar/Density/Alcohol/Quality
First, let’s look at the summary statistics for these variables:
From these histograms we can see that all but citric acid (g / dm^3) has a relatively normal distribution. To try and get a more normal distribution I transformed the citric acid variable, using the scale square root function to generate a slightly more normal distribubton, but since there are still a large number of wines in this data set with no citric acid, there is a large count at the zero point on the x-axis which affects the normal distribution.
For these four variables, we see a variety of distributions. Chlorides and sulphates have a more normal distribution, with some high level outliers. The two sulfur dioxide variables have long-tailed, negative distributions.
The use of boxplots will help illuminate the issue of outliers within the chlorides and sulphates variables.
Here we can see just how many and how significant some of the outliers are for these two variables. For cholorides the mean median is just below .08 (g / dm^3) and a mean of .087, but we can see a number of obseravations above .2, with one above .6. Similarly for sulphates, we see a mean and median of .66 and .62 (g / dm3) respectively. But in the boxplot there is a cluster of outliers above 1.0 with some going at high at 2.0
Returning to the original historgrams, both sulphates and chlories have a relatively normal distribution, when ignoring the outliers. To help focus on the normal distribution and remove some of the larger outliers, I will transforming these two historgrams by cutting the top 5% of observations.
Both of these now look a little better. Cutting the top 5% of sulphate and chloride observations generates a relatively normal distribution.
Transforming the two sulfur dioxide variables using log10 does help change the plot to a more normal distribution. For the free sulfur dioxide the log10 develops something close to a bi-modal distribution. The log10 transformation really helps with the total fulfur dioxide variable, as there is now have a nice normal distribution with most wines between 10 and 100 (mg / dm^3).
For the residual sugar variable, most wines have between 1 and 3 (g / dm^3), but there are also a handful of wines with extremely high sugar counts. This seems reasonable as certain, frutier wines can have a sweeter taste. Density has a nice normal distribution with most wines falling between .995 and 1 (g / cm^3). The lack of diversity of density makes me think this is not a major factor in determinig a wines quality. Alcohol has a slighly long-tailed, negative distribution. That said there are a large percentage of wines between 9.3 and 9.5 (% by volume).
Finally we have quality, the outcome variable. As mentioned before, while quality is on a 0 - 10 scale, all the wines in this data set are betwen 3 and 8. Looking at the histogram we see the majority of the wines are in the 5 and 6 category. Looking at the distribution, I would like to group the quality variable into 3 categories for future analysis: Low (3 or 4 quality), Medium (5 or 6 quality) and High (7 or 8 quality). Having these three groups will help with bi-variate and multi-variate analysis later in this report.
##
## (2,4] (4,6] (6,8]
## 63 1319 217
As shown in the histogram and table, most wines in the data set are in the Mediumcategory (82%), with significantly smaller percentages of Low (4%) and High (14%) quality wines. This seems logical.
## 'data.frame': 1599 obs. of 14 variables:
## $ : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality_scale : Factor w/ 3 levels "(2,4]","(4,6]",..: 2 2 2 2 2 2 2 3 3 2 ...
The original data set includes 1599 observations of 12 variables. There are no categorical/factored variables in the data set. All variables are numeric except quality, which is an integer type.
Clearly the most important feature is quality. The crux of this data exploration will be to see how the various variables impact the quality rating of a wine. Since such small percentages of the wines fall into the Low and High quality categories, it will be interesting to see what variables have the biggest effect on wine quality in those categories.
Based on a review of the data set documentation, all of these variables play some factor in how a wine tastes and therefore the quality of the wine. After reading the description of the attributes and pulling from my limited knowledge of wine, I think the variables that will have the biggest impact will be volatile acidity (can cause vinegar taste), residual sugar (level of sweetness), pH (acidity/base of a wine), chlorides (saltiness) and alcohol content.
Yes, I created a new, categorical variable from the quality variable called Quality Scale. For this variable, I took the quality variable and divided it into Low (3-4 quality), Medium (5-6) and High (7-8) categories.
The chlorides and sulphates variables were interesting because they both had somewhat normal distributions, with a number of scattered data ponints in the very high range of the distribution. For those two variables I stripped the top 5% of values.
The free and total sulfur dioxide variables were also interesting because they both had long tailed distributions. I transformed those by scalling them with log10. The result was total sulfur dioxide became somewhat normal, while the free sulfur dioxide variable became somewhat bi-modal.
Based on a review of the data set, it can be assumed that all variables in this data set are in some way correlated with wine quality. Rather than trying to plot each variable and do individual correlation tests, I am first going to do a plot matrix to try and highlight the 3 or 4 variables that are most correlated with quality.
Now we have something to work with! From the Udacity lesson we learned that 0.3 or -0.3 is the cut off for a meaningful (but VERY SMALL) correlation, .5 or -.5 is the cut off for a MODERATE correlation and .7 or -.7 is the cut off for a LARGE correlation. Using this information there are VERY FEW notable correlations for wine quality. The only two variables with a meaninful correlation are alcohol (.476) and volatile acidity (-.391).
There also appears to be some other meaninful coorelations, but they are hard to pick out in this plot matrix. To help explore these relationships, I am going to use another plot matrix, this from the “corrplot” plackage.
Using the corrplot matrix, the relevant coorelations really pop out! Beyond the two mentioned above, other notable correlation are between Ph and fixed acidity (-.68) and citric acid and fixed acidity (.67). The final, and maybe most interesting, non-quality, correlation is between density and fixed acidity (.668).
Now I will generate boxplots (for those involving quality) and scatter plots each of the five most relevant bi-variate relationships from the data set. Each of the scatter plots will also have a linear model smoother lines to help illustrate the positive (red) or negative (blue) relationship between the variables. These plots will help visulize the correlations highlighted in the plot matrix.
For the alcohol vs. quality boxplot (CC: -.391) we see a significant distinction in the alcohol context for wines with a 7 or 8 quality rating. The high quality wines have mean and median alochol levels above the third quartile for medium and lower quality wines. Looking at the plot of volatile acidity and quality, we see a downward trend for volatile acidity as quality improves. Again, higher quality wines have a median below the first quartile of medium quality wines. We also see a much larger distribution of volatile acidity for low quality wines (3 and 4 levels), with the distribution going down as wine quality increases.
For the three scatterplots we observe a negative relationship between fixed acitiy and pH (CC: -.68), and positive relationships for citric acidity v. fixed acidity (CC: .67) and density v. fixed acidity (CC: .668). A more detailed analysis of these relationships is discussed in the Bivariate Analysis section below.
In terms of the variables contributing to the quality of wine, I was surprised so few, only two, had even a small correlation. As for the other significant relationships, they all make sense. Items with a low pH (0-3) are very acidic, (i.e. lemon juice, vinegar) moving toward very neutral items (Milk etc.) in the 6-8 pH range. Therefore, a negative correlation between fixed acidity and pH makes complete sense. On the inverse, high fixed acidity positively correlating to citric acid also makes sense for the same reason explained for the pH correlation.
The positive correlation between fixed acidity and density is a little less obvious to me. But, Some prelimiary research into organic chemistry shows that acids have more density than non-acids chemicals. For example tartaric acid, which occurs naturally in most plants and is common in wine, and is the acid used for the measure has a density of 1.79 g/cm^3, while water has a density of 1 g/cm^3. With this new information, the positive correlation between fixed acidity and density makes complete sense.
In terms of the strongest relationship for the main variable of interest, quality, the strongest relationship is alcohol (.476). For non-key variables, the strongest relationship is between fixed acidity and pH (-.68)
Due to the lack of categorical variables, beyond the one I created for quality (quality_scale), multivariate analysis aroud the outcome variable, quality, is somewhat difficult. That said, there are still some interesting plots to explore.
Based on a review of the data set information and the plot matrix in my bivariate analysis, I identified volatile acidity, citric acid, alcohol, sulphates and chlorides as key factors in wine taste and quality. To better understand the variables and how wines in the three quality_scale groups (Low, Medium and High) compare to their relationships, I am going to develop some scatter plots for combinations of these variables and color the dots based on the wine quality scale I developed.
For the citric acid v. volatile acid plot, I can see a lot of blue (High Quality) plots around the .4 g / dm^3 line of valatile acidity and between .25 and .75 g / dm^3 of citric acid. I can also see a somewhat downward trend in volatile acid as citric acid increases for the medium quality wines. Finally, I notice a number of Low Quality wines with very nigh volatile acidity levels and low alcohol levels. This falls in line with correlation coefficients for these variables and quality discussed earlier.
Unfortunately, it is a little harder to see trends in the other two plots. The sulphates v. chlorides plot is really clustered in the lower left corner of the graph with no clear distinctions across the wine qualty groups. In the final plot I see lot of high quality wines in the higher alcohol and lower volatile acidity section of the graph. As these are the two highest correlated variables to quality, I will return to explore this plot more in the Final Plots section later in the report.
To help with the multivariate analysis, I built a linear regression model utilizing the four variables that have a correlation coeffecient to quality above .2. The model will include volitile acidity (-.39), citric acid (.23), sulphates (.25) and alcohol (.48)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = red_wine_cor_set)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red_wine_cor_set)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid,
## data = red_wine_cor_set)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid +
## sulphates, data = red_wine_cor_set)
##
## ================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.055*** 2.646***
## (0.175) (0.184) (0.194) (0.201)
## alcohol 0.361*** 0.314*** 0.314*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.343*** -1.265***
## (0.095) (0.114) (0.113)
## citric.acid 0.068 -0.079
## (0.103) (0.104)
## sulphates 0.696***
## (0.103)
## ----------------------------------------------------------------
## R-squared 0.227 0.317 0.317 0.336
## adj. R-squared 0.226 0.316 0.316 0.334
## sigma 0.710 0.668 0.668 0.659
## F 468.267 370.379 246.976 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1621.596 -1599.093
## Deviance 805.870 711.796 711.603 691.852
## AIC 3448.114 3251.628 3253.192 3210.186
## BIC 3464.245 3273.136 3280.078 3242.448
## N 1599 1599 1599 1599
## ================================================================
The information on the attributes of the variables in the wine data set says volatile acidity is “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. This does not seem like a good feature for a wine. At the same time citric acid can add a freshness or fuity note to wines, which can be a good thing. This relationship, plays out very nicely in the volatile acidicty vs. citric acid scatter plot, especially when we look at different color groups for low, medium and high quality wines. Not all wines have citric acid, so all have groups have points at zero on the x-axis. But the points of volatile acid for lower quality wines starts appearing consistently at a much higher level and stays that way as the levels of citirc acid increase.
As mentioned before, the sulphates vs. chlorides plot was a little less interesting, as so many points were crammed together within a relatively small scale of sulphates and chlorides. Even when using alpha to help reduce overcrowding, there is not a lot that jumps out.
The final plot is of alcohol and volatile acidity, the two variables with the highest correlation coefficient for quality. The points are a little all over the place, with low quality wines typically having a high level of volatile acidity throughout the increase in alcohol per volume.
Sadly, not really. I was hoping for some stuff to jump out from these plots in terms of seeing some clear distinctions between the three quality groups. I’m hoping some jiggering and adjusting of the volatile acidity v. alcohol graph in final section might illuminate why these variables are the most prominent features in wine quality for this data set.
For my analyss I created a linear regression model to predict quality based on the four variables with the highest, individual, correlation coeffecients for quality. Overall, for a four variable linear regression model, with a relatively small number of observations, the model is not bad. We see from the output that all the coeffecients except citric acid have a 99% significance within the full model (m4). In addition the R-squared and adjusted R-squared are both about .33, which means that these four variables predicts 33% of the variablility in wine quality. Seeing as how subjective wine quality can be, getting that high an R-squared with a small model is pretty good in my opinion.
As for weaknesses of the model overall, I reall wish the data set was larger. I think only 1599 observations and the fact that quality only ranged from 3-8 (on a 0-10 scale) hurts the model slightly. More observations might help establishmore variables as being key factors in determining a wines overall quality. As for the specific model, while citric acid had a correlation coefficient of .23 for quality, it is the only variable in the model to not have a statisticaly significant coefficient in linear model. The lack of predictivity in citric acid for quality is also evident in the fact that the addition of citric acid to the model did not change the R-squared or adjusted R-squared. Therefore, given more time I might consider looking for alternative variables for the linear regression model.
For my first plot I wanted to illustrate how the three wine quality groups differ in relation to volatile acidity. While common in almost all wine, increasing levels volatile acidity (in this data set it is acetic acid) will give wine a “stale” or vinegar like taste. This is most definately not a good feature to have for a high quality wine. The histogram here does a good job of showing how high quality wines cluster in the low volatile acidity range (.22 - .42 (g / dm^3), with very few high quality wines above. .72 g / dm^3. Alternatively, there are a large number of low and medium quality wines above .72, with a number of wines in the .92 - 1.32 g / dm^3 range. Clearly, having a low volatile acidity level is important when trying to make a high quality wine.
I really like plot matrices. They help me get both a visual and numerical grasp on variables in a data set and their relationships. With that in mind I wanted to create a more comprehensive plot matrix that focuses more on numerical correlation coefficents and less of the plots. My second plot is a plot matrix developed with the corrplot package. The matrix analyizes the correlation of the 12 variables to each other. Each coordinate is a colored square, where the color indicates a positive or negative correlaation and the darkness of the color indicates the level of correlation (0 - 1/-1). In addition, I also used the Intro to CorrPlot Package website to compute the significance/p-value of each correlation coefficient. I used a 95% significance level, so any correlation coefficient that is not at least at the 95% significant level is x’d out. Looking at this graph really helps draw attention to the key relationships among the variables, which was important in helping me figure out which variables to plot and which to use in my linear regression model.
I wanted to include a multivariate graph for my final plot. I selected this volatile acid v. alcohol scatter plot because these are the two variables with the strongest correlation to wine quality. My initial scatter plot that I included in the multivariate plot section above was good, but a little hard to interpret due to overplotting. In this version I jittered the plots to help visualize them better and added linear regression smoother lines. I think the smoother lines really help illustrate the differences of the wine quality groups in this graph. The high quality lines have a consistent, low volatile acidity level, across all alcohol levels. In contrast the low quality wines have a much higher volatile acidity level that increases slightly as alcohol increases (although that increase is probably due to some low quality wines with very high volatile acid levels in the 10.5% - 12% alcohol range). Finally, the medium quality wines have a slight negative slope in their linear regression line. I am not really sure why this occures, but the fact that higher alcohol content is correlated with high quality wine may be a factor.
Overall, I was pleasantly surprised with the data set and insight gathered. As I mentioned earlier, I am not a big wine connoisseur, so working with this data really enlighted me on key features of wine and how they correlate to wine quality. The negative correlation of volatile acidity on wine quality was a fun discovery. While the high positive correlation between alcohol conent and quality was a little surprising. I don’t think it is possible to taste an increase in alcohol content in wine, so this correlation was interesting. I suspect there are other variables not included in this data set, that might explain this positive correlatino a little better.
My biggest struggle was finding quality multivariate plots to explore. The lack of categorical variables and input variables with high correlation coefficients meant I needed to do a lot of tests to find plots that “told a story” around wine quality. Another big stuggle was around refining and embelishing the plots for the final plot sections. I spent a lot of time trying to get the third final plot to look right and make sense.
In terms of successes, I really think my updated plot matrix using corrplot does an excellent job of give a comprehensive overview of correlation coefficients acorss all variables. As I discussed earlier in the report, for me I like to seeing all the variable relationships laid out in a single visual. But I find the GGally plot matrix too busy and the plots in the lower half too small and cramped to really get much out of. The final corrplot matrix allows for a really quick and easy way to identify positve or negative correlations (colors), their exact levels and if they are statistically significant. Developing this early in a data analysis project can help to expediate the bivariate analysis process.
As I mentioned before, I would really like a larger data set with more variables and observations. One of the data set options for this project was a white wine version. I think a fun future project would be to combine the two data sets and see if the same variables have similar correlations for the two wine types. I suspect there are certain variables that have strong or negative correlations for wine quality among white and red wines. Recreating some of my plots with a facet wrap for wine type (red v. white) would be really interesting.